Setup

Packages Used

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(waffle)
library(plotly)

Datasets Used

Background of the Data

The datasets were downloaded from Disney+ Movies and TV Shows | Kaggle and TV shows on Netflix, Prime Video, Hulu and Disney+ | Kaggle. You can read about them there, including variable definitions, sources, when they were created, and other information. Load the two datasets and use glimpse() to explore their structures.

Disney+ Dataset

disneydata <- readRDS("/home/students/lyonscf/STT 2860/STT2860F22project3/data/disneypluscontent.rds")

Streaming Dataset

streamingdata <- readRDS("/home/students/lyonscf/STT 2860/STT2860F22project3/data/streamingcontent.rds")

Analyses

Create the Analysis Datasets

Disney+ Dataset

Use select() to delete the variables director, cast, country, listed in, and description from the dataset.

disneyedits <- disneydata %>%
  select(show_id, type, title, date_added, release_year, rating, duration, duration_unit)

Streaming Dataset

I used a function called pivot_longer() on the downloaded data to change the shape of the dataset. You will need to do additional necessary editing on the dataset before you analyze it.

  • Use filter() to remove any row where YesNo is 0 (a 0 means it is not on the service).
  • Use the separate() function to split IMDb. Separate the show rating from the max rating of 10.
  • Use the separate() function to split RottenTomatoes. Separate the show rating from the max rating of 100.
  • Use mutate() to convert the shows’ IMDb and Rotten Tomatoes ratings into numerical variables instead of categorical.
streamingedits <- streamingdata %>%
  filter(!YesNo == 0) %>%
  separate(col = IMDb, into = c('IMDbRating', 'IMDbMaxRating'), sep='/') %>%
  separate(col = RottenTomatoes, into = c('RottenTomatoesRating', 'RottenTomatoesMaxRating'), 
           sep='/') %>%
  mutate(IMDbRating = as.numeric(IMDbRating),
         IMDbMaxRating = as.numeric(IMDbMaxRating),
         RottenTomatoesRating = as.numeric(RottenTomatoesRating),
         RottenTomatoesMaxRating = as.numeric(RottenTomatoesMaxRating))

Visualization 1: Release Year

These plots use the Disney+ Dataset.

A frequency polygon (geom_freqpoly()) is an alternative to a histogram. Rather than displaying bars, it connects the midpoints of a histogram’s bars with line segments. Create a frequency polygon for the year in which Disney+ content was released. Add an appropriate title and axis labels. Use other formatting as you choose to enhance effectiveness/appearance.

ggplot(disneyedits, aes(release_year)) +
  geom_freqpoly() +
  scale_x_continuous(breaks = seq(1920, 2020, by = 10)) +
  labs(x = "Release Year", y = "Release Count", title = "Disney Content Released") +
  theme_linedraw()

Create a violin plot of release_year (x-axis) grouped by type of program (y-axis) for content on Disney+. Fill with a color of your choice. Add a boxplot inside the violin plot, as you did in one of the DataCamp exercises. Re-scale the x-axis so that tick marks appear at whole-decade intervals (e.g., 1980, 1990, 2000). Add an appropriate title and axis labels. Use other formatting as you choose to enhance effectiveness/appearance.

ggplot(disneyedits, aes(x = release_year, y = type)) +
  geom_violin(trim = FALSE, fill ='#702963')+
  geom_boxplot(width = 0.1) + 
  theme_minimal() +
  scale_x_continuous(breaks = seq(1920, 2020, by = 10)) +
  labs(x = "Release Year", y = "", title = "Disney Content Releases")

Visualization 2: Program Type

This plot uses the Disney+ Dataset.

Create a waffle plot (which you learned in DataCamp: Visualization Best Practices in R) to display the distribution of program type on Disney+.

  • Give the plot the title “Streaming Content on Disney+”.
  • Change the colors of the squares to something other than the defaults.
  • Use an x-axis label to indicate roughly how many programs each square represents.

Hint: Use round(100 * prop.table(table(DATASETNAME$VARIABLENAME))) to create the “case_counts” data for the waffle plot. Swap out the capital letter placeholders in the instructions for the correct dataset name and variable name.

case_counts <- round(100 * prop.table(table(disneyedits$type)))
waffle(case_counts) +
  labs(title = "Streaming Content on Disney+", x = "Square = 1 Streaming Content") +
  scale_fill_manual(values = c("#702963", "black")) +
  guides(fill = guide_legend(title = "Type"))

Visualization 3: Choose a Plot!

This plot uses the Disney+ Dataset.

Create one other plot of your choice from the Disney+ Dataset to explore a question of interest. You are welcome to perform additional manipulations on the data, if needed. Add an appropriate title and axis labels, as well as any other necessary formatting.

disneymovies <- disneyedits %>%
  filter(type == "Movie") %>%
  mutate(duration = as.numeric(duration))
disneychoice <- ggplot(disneymovies, aes(x = release_year, y = duration, text = paste("Title:", title,"<br> Release Year:", release_year, "<br> Minutes:", duration))) +
  geom_point() +
  scale_x_continuous(breaks = seq(1920, 2020, by = 10)) +
  scale_y_continuous(breaks = seq(0, 180, by = 30)) +
  labs(x = "Release Year", y = "Duration (min)", title = "Disney+ Movies") +
  theme_minimal()
ggplotly(disneychoice, tooltip = "text")

Visualization 4: Content Volume

This plot uses the Streaming Dataset.

Create a barplot to display how many shows are offered on each of the four streaming services. Choose appropriate colors, labels, themes, and/or and other types of formatting that you feel will enhance the meaning or visual appearance of the plot.

ggplot(streamingedits, aes(Service, fill = Service)) +
  geom_bar(width = .25) +
  labs(title = "Streaming Service Shows", y = "Count") +
  scale_fill_manual(values = c('#153866', '#66aa33', '#E50914', '#00A8E1')) +
  theme_minimal()

Visualization 5: Choose a Plot!

This plot uses the Streaming Dataset.

Create one other plot of your choice from the Streaming Dataset to explore a question of interest. You are welcome to perform additional manipulations on the data, if needed. Add an appropriate title and axis labels, as well as any other necessary formatting.

bestshows <- streamingedits %>%
  filter(RottenTomatoesRating >= 90)
ggplot(bestshows, aes(Service, fill = Service)) +
  geom_bar(width = .25) +
  scale_y_continuous(breaks = seq(0, 17, by = 1)) +
  labs(title = "90+ Rated Streaming Service Shows", y = "Count") +
  scale_fill_manual(values = c('#153866', '#66aa33', '#E50914', '#00A8E1')) +
  theme_minimal()


Questions

Question 1: Based on your plots, make five informational statements or comparisons regarding the Disney+ streaming service.

ANSWER

  1. There was a spike in content release for Disney from 2010-2021.

  2. According to Rotten Tomatoes Hulu and Netflix offer the best rated shows.

  3. Netflix offers the most variety of shows.

  4. Avengers’ Endgame is the longest duration movie available to stream on Disney+.

  5. Disney+ content consists of more movies than TV shows.

Question 2: What other data would you like to have, or which existing variables would you like to see transformed, if you were going to do further explorations or visualizations? Give at least two examples.

ANSWER

I would like to have movie data with the streaming shows data. Data analysis could really show which service ranks the best having all content. I would also like to have revenue produced from each movie and TV show. It would be interesting to compare the best movies and TV shows

Question 3: Explain the rationale behind the choices you made with regard to plot type, formatting, and so on, when you created Visualizations 3 and 5. Walk me through your process. What motivated your decisions?

ANSWER

With Visualization 3 I wanted to create a convenient and comprehensive ploty similar to the plotlys made with App State Baseball. Highlighting over each data point shows information not received through a static plot. I wanted to view the outliers and if duration correlated with release year. I was surprised that Disney released movies with high duration like The Sound of Music dating back to 1960. However the longest movie was released in 2019 (Avengers’ Endgame).

With Visualization 5 I wanted to view which streaming service had the best rated shows according to Rotten Tomatoes with filtering 90+ rating. I chose Rotten Tomatoes rather than IMDb because of their well known criticized ratings. It is extremely difficult to receive an “A” rating. With that said, not to my surprise, Hulu and Netflix had the same amount of “A” rated shows.


sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plotly_4.10.1 waffle_0.7.0  ggplot2_3.4.0 tidyr_1.2.1   dplyr_1.0.10 
[6] readr_2.1.3  

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0   xfun_0.35          bslib_0.4.1        purrr_0.3.5       
 [5] colorspace_2.0-3   vctrs_0.5.1        generics_0.1.3     viridisLite_0.4.1 
 [9] htmltools_0.5.3    yaml_2.3.6         utf8_1.2.2         rlang_1.0.6       
[13] jquerylib_0.1.4    pillar_1.8.1       glue_1.6.2         withr_2.5.0       
[17] DBI_1.1.3          RColorBrewer_1.1-3 lifecycle_1.0.3    stringr_1.5.0     
[21] munsell_0.5.0      gtable_0.3.1       htmlwidgets_1.5.4  evaluate_0.18     
[25] labeling_0.4.2     knitr_1.41         tzdb_0.3.0         fastmap_1.1.0     
[29] extrafont_0.18     crosstalk_1.2.0    fansi_1.0.3        highr_0.9         
[33] Rttf2pt1_1.3.11    scales_1.2.1       cachem_1.0.6       jsonlite_1.8.3    
[37] farver_2.1.1       gridExtra_2.3      hms_1.1.2          digest_0.6.30     
[41] stringi_1.7.8      grid_3.6.0         cli_3.4.1          tools_3.6.0       
[45] magrittr_2.0.3     sass_0.4.4         lazyeval_0.2.2     tibble_3.1.8      
[49] extrafontdb_1.0    pkgconfig_2.0.3    ellipsis_0.3.2     data.table_1.14.6 
[53] assertthat_0.2.1   rmarkdown_2.18     httr_1.4.4         rstudioapi_0.14   
[57] R6_2.5.1           compiler_3.6.0